AITopics | word classification

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more. Typically, it is found that larger models lead to better performance. However, the significant computational effort involved in such large transformer systems is a challenge for embedded and real-world applications. Recent work has shown that there is significant redundancy in the transformer models for NLP and massive layer pruning is feasible (Sajjad et al., 2023). Here, we investigate layer pruning in audio models. We base the pruning decision on a convexity criterion. Convexity of classification regions has recently been proposed as an indicator of subsequent fine-tuning performance in a range of application domains, including NLP and audio. In empirical investigations, we find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.

convexity, representation, speaker identification, (14 more...)

arXiv.org Artificial Intelligence

2408.11858

Country:

Europe > Denmark (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Visually grounded few-shot word acquisition with fewer shots

Nortje, Leanne, van Niekerk, Benjamin, Kamper, Herman

arXiv.org Artificial IntelligenceMay-25-2023

We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We propose an approach that can work on natural word-image pairs but with less examples, i.e. fewer shots. Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images. Additionally, we use a word-to-image attention mechanism to determine word-image similarity. With this new model, we achieve better performance with fewer shots than any existing approach.

artificial intelligence, machine learning, utterance, (14 more...)

arXiv.org Artificial Intelligence

2305.15937

Country:

North America > Canada > Quebec > Montreal (0.04)
Africa > South Africa (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning

van der Merwe, Ruan, Kamper, Herman

arXiv.org Artificial IntelligenceMay-22-2023

We consider the problem of few-shot spoken word classification in a setting where a model is incrementally introduced to new word classes. This would occur in a user-defined keyword system where new words can be added as the system is used. In such a continual learning scenario, a model might start to misclassify earlier words as newer classes are added, i.e. catastrophic forgetting. To address this, we propose an extension to model-agnostic meta-learning (MAML): each inner learning loop, where a model "learns how to learn'' new classes, ends with a single gradient update using stored templates from all the classes that the model has already seen (one template per class). We compare this method to OML (another extension of MAML) in few-shot isolated-word classification experiments on Google Commands and FACC. Our method consistently outperforms OML in experiments where the number of shots and the final number of classes are varied.

artificial intelligence, continual learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2305.1308

Country: Africa > South Africa (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

From Brain Waves to Real-Time Text Messaging

#artificialintelligenceNov-15-2022, 14:59:33 GMT

Posted on November 15th, 2022 by Lawrence Tabak, D.D.S., Ph.D. For people who have lost the ability to speak due to a severe disability, they want to get the words out. They just can't physically do it. But in our digital age, there is now a fascinating way to overcome such profound physical limitations. Computers are being taught to decode brain waves as a person tries to speak and then interactively translate them onto a computer screen in real time.

computer, paralysis, real-time text messaging, (11 more...)

#artificialintelligence

Country: North America > United States > California > San Francisco County > San Francisco (0.18)

Genre: Research Report (0.31)

Industry: Health & Medicine > Therapeutic Area (0.32)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Architecture > Real Time Systems (0.62)

Add feedback

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Chung, Yu-An, Weng, Wei-Hung, Tong, Schrasing, Glass, James

Neural Information Processing SystemsDec-31-2018

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.

audio segment, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts (0.28)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
(2 more...)

Add feedback

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Chung, Yu-An, Weng, Wei-Hung, Tong, Schrasing, Glass, James

Neural Information Processing SystemsDec-31-2018

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.

audio segment, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts (0.28)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
(2 more...)

Add feedback

What is Vector based word classification? • /r/MachineLearning

@machinelearnbotMay-27-2016, 08:25:45 GMT

What is Vector based word classification? Are you sure you don't mean support vector machine based word classification (like you did last time)? If not, then your boss probably means word2vec-style learning. That isn't really a classification method on its own, although it is possible to build classifiers on top of it. It's about a tool that mines texts from documents and then creates some kind of analysis based on that.

artificial intelligence, machine learning, word classification, (2 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.63)

Add feedback

Performance Through Consistency: MS-TDNN's for Large Vocabulary Continuous Speech Recognition

Tebelskis, Joe, Waibel, Alex

Neural Information Processing SystemsDec-31-1993

Connectionist Rpeech recognition systems are often handicapped by an inconsistency between training and testing criteria. This problem is addressed by the Multi-State Time Delay Neural Network (MS-TDNN), a hierarchical phonf'mp and word classifier which uses DTW to modulate its connectivit.y

level training, recognition, speech recognition, (13 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Technology: